High-Level Music Descriptor Extraction Algorithm Based on Combination of Multi-Channel CNNs and LSTM

نویسندگان

Ning Chen

Shijun Wang

چکیده

Although Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTM) have yielded impressive performances in a variety of Music Information Retrieval (MIR) tasks, the complementarity among the CNNs of different architectures and that between CNNs and LSTM are seldom considered. In this paper, multichannel CNNs with different architectures and LSTM are combined into one unified architecture (Multi-Channel Convolutional LSTM, MCCLSTM) to extract high-level music descriptors. First, three channels of CNNs with different shapes of filter are applied on each spectrogram image chunk to extract the pitch-, tempo-, and bassrelevant descriptors, respectively. Then, the outputs of each CNNs channel are concatenated and then passed through a fully connected layer to obtain the fused descriptor. Finally, LSTM is applied on the fused descriptor sequence of the whole track to extract its long-term structure property to obtain the high-level descriptor. To prove the efficiency of the MCCLSTM model, the obtained high-level music descriptor is applied to the music genre classification and emotion prediction task. Experimental results demonstrate that, when compared with the hand-crafted schemes or conventional deep learning (Multi Layer Perceptrons (MLP), CNNs, and LSTM) based ones, MCCLSTM achieves higher prediction accuracy on three music collections with different kinds of semantic tags.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classification of Iranian Traditional Music Dastgahs Using Features Based on Pitch Frequency

The Iranian traditional music is composed of seven majors Dastgahs: Chahargah, Homayoun, Mahour, Segah, Shour, Nava, and Rast-Panjgah. In this paper, a new algorithm for the classification of the Iranian traditional music Dastgahs based on pitch frequency is proposed. In this algorithm, the features of Lagrange coefficients of pitch logarithm (LCPL), Fuzzy similarity sets type 2 (FSST2), and th...

متن کامل

From Sound to ‘Sense’ via Feature Extraction and Machine Learning: Deriving High-Level Descriptors for Characterising Music

Research in intelligent music processing is experiencing an enormous boost these days due to the emergence of the new application and research field of Music Information Retrieval (MIR). The rapid growth of digital music collections and the concomitant shift of the music market towards digital music distribution urgently call for intelligent computational support in the automated handling of la...

متن کامل

Recognition of Visual Events using Spatio-Temporal Information of the Video Signal

Recognition of visual events as a video analysis task has become popular in machine learning community. While the traditional approaches for detection of video events have been used for a long time, the recently evolved deep learning based methods have revolutionized this area. They have enabled event recognition systems to achieve detection rates which were not reachable by traditional approac...

متن کامل

A Novel Multicast Tree Construction Algorithm for Multi-Radio Multi-Channel Wireless Mesh Networks

Many appealing multicast services such as on-demand TV, teleconference, online games and etc. can benefit from high available bandwidth in multi-radio multi-channel wireless mesh networks. When multiple simultaneous transmissions use a similar channel to transmit data packets, network performance degrades to a large extant. Designing a good multicast tree to route data packets could enhance the...

متن کامل

Pseudo Zernike Moment-based Multi-frame Super Resolution

The goal of multi-frame Super Resolution (SR) is to fuse multiple Low Resolution (LR) images to produce one High Resolution (HR) image. The major challenge of classic SR approaches is accurate motion estimation between the frames. To handle this challenge, fuzzy motion estimation method has been proposed that replaces value of each pixel using the weighted averaging all its neighboring pixels i...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

High-Level Music Descriptor Extraction Algorithm Based on Combination of Multi-Channel CNNs and LSTM

نویسندگان

چکیده

منابع مشابه

Classification of Iranian Traditional Music Dastgahs Using Features Based on Pitch Frequency

From Sound to ‘Sense’ via Feature Extraction and Machine Learning: Deriving High-Level Descriptors for Characterising Music

Recognition of Visual Events using Spatio-Temporal Information of the Video Signal

A Novel Multicast Tree Construction Algorithm for Multi-Radio Multi-Channel Wireless Mesh Networks

Pseudo Zernike Moment-based Multi-frame Super Resolution

عنوان ژورنال:

اشتراک گذاری